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ABSTRACT 


The  purpoee  of  the  research  carried  out  under  this  contract 
has  been  the  development  of  mathematical  methods  and  computer 
programs  for  the  extraction  of  meaningful  Information  from 
biological,  primarily  neurophysiological,  measurements.  Emphasis 
has  been  placed  on  statistical  methods  suitable  for  separating  two  or 
more  random  signals  and  which  provide  insight  into  the  underlying 
mechanism  by  which  the  signals  are  generated.  Loeve -Karhunen 
expansion  and  Discriminant  Analysis  methods  are  applied  to  the 
problem  of  time  signal  classification.  Experiments  are  performed 
both  on  computer  generated  time  signals  and  on  electroencephalograms. 
Methods  of  coping  with  the  singularity  problem  arising  from  a 
small  sample  sise  are  invest' gated. 
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SECTION  I 


INTRODUCTION 


The  purpose  of  the  research  reported  in  this  and  a  previous 
volume  *  has  been  the  development  of  new  mathematical  methods  and 
computer  programs  for  the  extraction  of  meaningful  information  from 
biological,  primarily  neurophysiological,  measurements.  Emphasis 
has  been  placed  on  statistical  methods  suitable  for  separating  two  or 
more  random  signals  and  which  provide  insight  into  the  underlying 
mechanism  by  which  the  signals  are  generated. 

The  previous  efforts  were  largely  concentrated  on  the  Loeve- 
Karhunen  decomposition  of  continuous  signals.  The  work  covered  by 
this  report  is  directed  toward  an  evaluation,  extension  and  modifi¬ 
cation  of  this  method. 

In  attempting  to  extract  meaningful  information  from  biological 
measurements,  several  fundamental  operations  are  often  required. 
They  are 

Compre ssion:  i.e.  ,  the  reduction  of  the  amount  of  information 
which  must  be  stored,  consistent  with  the  need  for  subsequent 
processing  or  reconstitution 

Clustering:  i.e.  ,  the  separation  of  a  set  of  measurements  into 
subsets,  the  number  of  which  may  or  may  not  be  predetermined, 
on  the  basis  of  some  measure  of  similarity 

Classification:  i.e.,  the  operation  of  assigning  a  measurement 
to  one  of  a  number  of  existing  classes. 

In  this  research,  "biological  measurements"  are  assumed  to  be 
time  signals,  i.e.,  continuous  scalar  functions  of  time,  having  a 
known  bandwidth.  The  continuous  signals  can  be  replaced,  without 
loss  of  generality,  by  discrete  time  series  produced  by  appropriate 
sampling  techniques. 

To  provide  insight  into  the  choices  of  methodology  and  under¬ 
standing  of  assumptions  that  must  be  satisfied  before  significant 
solutions  can  be  expected,  some  of  the  differences  between  biological 
signals  and  those  classes  of  signals  more  frequently  encountered  in 
classification  problems  should  be  considered. 
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Moat  classification  problems  which  have  been  solved  (e.  g.  , 
communication  problems,  optical  character  recognition  problems) 
deal  with  a  finite  number  of  deterministic  signals  which  are  con¬ 
taminated  primarily  by  additive  noise.  In  contrast,  many  biological 
signals  (e.g. ,  the  spontaneous  electroencephalogram)  have  no 
deterministic  component  and,  therefore,  can  only  be  described  in 
statistical  terms.  If  an  attempt  is  made  to  define  the  statistical 
parameters  by  averaging  a  number  of  signals  and  those  signals  have 
been  significantly  contaminated  by  nonadditive  noise  (such  as  multi¬ 
plicative  noise,  random  time  base  compression  or  random  phase  shift), 
the  averaging  may  eliminate  the  signal.  If  systematic  distortions  have 
been  introduced  in  the  measurement  proce ss,  the  signal-to-noise  ratio 
may  actually  be  worsened  by  averaging. 

Another  problem  arising  in  processing  biological  signals  is 
whether  the  desired  information  is  at  all  present  in  the  signal  being 
processed.  Often  it  will  be  necessary  to  assume  the  information  re¬ 
sides  therein.  A  negative  result  can  therefore  mean  either  that  the 
information  is  not  present  or  that  the  retrieval  technique  is  deficient. 

The  sample  size  may  be  relatively  small  in  biological  processing 
problems,  because  of  the  difficulty  in  making  measurements  or  finding 
sufficient  numbers  of  specimens  of  a  certain  class.  This  limitation 
may  affect  the  choice  of  decision  procedures. 

Most  of  the  methods  described  in  this  report  are  based  on  linear 
operations,  which  are  based  on  the  minimization  of  some  mean- square 
error  criterion.  These  limitations  have  been  imposed  for  the  sake  of 
mathematical  and  computational  simplicity.  Any  explicit  or  implicit 
claim  of  optimality  should  be  interpreted  in  the  context  of  these  limi¬ 
tations. 

The  potential  advantages  of  using  computers  to  compress,  cluster, 
or  classify  biological  data  may  seem  too  obvious  to  warrant  discussion. 
However,  for  completeness  the  following  points  are  made: 

Comp  re  s  si  on 

It  appears  entirely  feasible  to  replace  simple  or  multiple  time 
signals,  or  pictures,  by  sets  of  numbers  such  that  a)  any  processing 
can  be  based  on  these  numbers  in  lieu  of  the  original  signals,  or  b) 
the  original  signals  can  be  reconstituted  at  will  to  any  predetermined 
degree  of  accuracy.  The  primary  advantages  are  the  convenience  and 
economy  of  transmitting  and  storing  a  few  numbers  instead  of  time 
signals  or  pictures. 


2 .  Clustc  ring 


Clustering  routines  are  useful  for  locating  and  defining  subsets 
within  a  measurement  set  on  the  basis  of  commonly  held  attributes 
which  may  not  be  apparent  from  visual  inspection.  The  information 
derived  about  the  nature,  location,  and  quantity  of  these  '’clusters" 
may  provide  a  basis  for  the  formulation  of  hypotheses  regarding  the 
underlying  mechanism  by  which  the  signals  have  been  generated.  In 
some  cases  it  will  be  possible  to  test  these  hypotheses  by  carrying 
out  classification  experiments  on  additional  data. 

3.  Classification 


The  automatic  classification  of  biological  data  can  be  considered 
at  several  levels  of  difficulty.  At  the  simpler  level  are  applications 
in  which  discrimination  is  required  only  between  good  records  and 
those  containing  gross  errors  or  artifacts,  or  between  records  with 
or  without  some  large,  well-defined  occurrence.  Some  of  these 
rudimentary  but  important  applications  are  well  within  the  capabilities 
of  the  methods  described  herein.  The  classifying  performance  on  the 
basis  of  progressively  more  subtle  attributes  will  have  to  be  evaluated 
on  the  particular  data  classes  of  interest.  In  the  more  complex 
problems,  the  methods  presented  here  may  be  useful  components  of 
a  more  comprehensive,  as  yet  undeveloped,  methodology. 


3 


SECTION  II 

EVALUATION  OF  LOE VE -KARHUNEN  METHODS  AS  APPLIED 
TO  THE  ANALYTICAL  CLASSIFICATION  OF  BIOLOGICAL  DATA 


1 .  Integral  Equation  Formulation 

Loe ve -Karhunen  methods  are  baaed  on  certain  desirable  proper 
ties,  (i.e.,  orthogonality,  completeness,  efficiency)  of  the  eigen* 
functions  of  linear  integral  equations  having  symmetric  kernels.^ 
Karhunen^  and  Loeve*  applied  this  theory  to  the  decomposition  and 
discrete  representation  of  random  time  functions,  in  which  case  the 
kernel  is  the  autocovariance  function  of  the  random  process.  The 
methods  have  found  practical  application  in  the  design  of  communi¬ 
cation  and  radar  detection  systems. 

These  methods  completely  specify  the  design  of  a  set  of  linear 
filters  that  are  "optimal"  (in  a  sense  that  will  be  described  below) 
for  analyzing  the  random  process  in  question.  These  filters  have 
weighting  functions  matched  to  the  eigenfunctions  corresponding  to 
the  largest  eigenvalues.  The  number  of  filters  required  in  any  appli 
cation  can  be  readily  determined  as  a  simple  function  of  the  magni¬ 
tudes  of  the  eigenvalues.  If  it  is  necessary  to  analyze  a  random 
process,  the  statistics  of  which  are  stationary  but  unknown]  the 
following  procedure  is  indicated: 

1.  Record  a  statistically  representative  sample  of  the  process 

2.  Compute  the  eigne  values  and  eigne  functions  of  a  sample 
covariance  matrix  of  the  process 

3.  Adjust  the  filter  bank,*  in  accordance  with  the  above. 

The  filter  bank,  then  will  more  efficiently  analyze  that  random 
process  than  any  band-pass  filter  bank,  or  any  other  linear  filter 
having  the  same  number  of  paths. 

If  the  statistics  of  the  process  change  slowly  and  smoothly  with 
time,  periodic  recomputation  and  adjustment  provide  the  basis  for 
an  efficient  time  varying  analyzer. 


♦  The  term  "filter  bank"  is  used  for  simplicity,  although  it  is  antici¬ 
pated  that  the  filter  will  usually  be  simulated. 
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The  efficiency  of  the  L-K  method  with  respect  to  the  analysis  or 
compression  problem  can  be  Illustrated  as  follows:  Consider  a  sero- 
mean  random  proce  ss  (x  (t)  }  » 0  s  t  sT, 

Any  sample  function,  x  (t),  can  be  described  without  loss  of 
information  by  k  numbers,  according  to  Shannon's  Sampling  Theorem, 
where  k  *  2  wT,  w  *  bandwidth  of  {x  (t)  },  and  where  the  k  numbers 
are  the  sampled  values  of  the  signal  at  time  intervals  l/(2w). 

Alternatively,  any  such  signal  can  be  described  to  any  desired 
degree  of  accuracy  by  n  numbers,  (n  to  be  determined  as  described 
below)  according  to  Loeve -Karhunen  theory,*  where  the  numbers  are 
given  by 

T 

Sj  *  y  x  (t)  t|i  (t)  dt,  J»  1 ,  2,  • .  * ,  n.  (1) 

0 


(t)  is  the  eigenfunction  corresponding  to  the  j'th  largest 


#i  (t)  is  the  eigenfunction  correspond! 
eigenvalue,  \  of  the  integral  equati 


on 


T 

J  (t,  T)  4-  (“0  dT  *  k  i|i  (t),  0  StsT 
0 

where 


Kx  (t,  T)  *  E  [ x(t)  x(r)] 


(2) 


(3) 


*  This  neglects,  for  the  time  being,  the  storage  required  for  the 
eigenfunctions. 
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Than,  x(t)  is  said  to  "optimally  approximated"  in  the  sense  that 
the  error  of  approximation, 


le  minimised  for  any  value  of  n  by  choosing  the  #  (t) 1  s  and  a.*s  to 
be  the  solutions  of  equations  (2)  and  (1) ,  respectively. 

The  value  of  n  can  be  readily  determined  as  a  function  of  the 
tolerable  approximation  error  in  terms  of  the  eigenvalues  since 


Therefore,  for  a  given  random  process,  x(t),  and  a  tolerable 
error  of  approximation,  a  readily  determined  number  of  eigenfunctions 
must  be  computed  and  stored.  Using  these  eigenfunctions  as  a  basis 
for  analysing  i  signals  subsequently  received  from  the  random  process, 
the  following  comparison  can  be  made: 

"Shannon"  representation  1  x  k  numbers 

L-K  representation  I  x  n 

(  numbers 

+  n  x  k 

where  the  I  x  n  numbers  are  the  coefficients  determined  by  (1)  and 
the  n  x  k  numbers  are  the  sampled  representation  of  the  eigenfunctions. 
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2.  Geometrical  Interpretation 


The  approximation  problem  has  the  following  geometrical 
interpretation.  Considering  the  random  process  x(t)  to  be  a  random 
vector  $  in  an  m -dimensional  vector  space  r»  the  approximation 
x(t)  *  at  ^i  ^  *■  n -dimensional  sub  space  of  TV  which 

shall  be  denoted  by  T  -  The  approximation  problem  is  then  the 
problem  of  finding  the  orientation  of  Tn  which  minimises  the  approxi¬ 
mation  error  8  =  E  |j  x  -  x  |aj 

It  is  readily  shown/*  that  6 is  minimised  by  finding  the  orien¬ 
tation  of  r  which  on  the  average  maximises  the  squared  length  of 
the  projec&on  of  x. 

Although  the  Loeve-Karhunen  theory  has  been  developed  in  the 
context  of  integral  equations  and  continuous  time  signals,  any  digital 
implementation  of  the  method  necessarily  demands  that  the  continuous 
signal  be  replaced  by  a  time -sampled  representation.  As  noted  above, 
this  replacement  can  be  accomplished  with  no  loss  of  information  if 
an  upper  bound  on  the  bandwidth  of  the  signal  is  known,  a  condition 
which  certainly  can  be  satisfied  in  any  class  of  biological  measure¬ 
ments  of  interest.  Therefore,  for  all  practical  purposes  Loeve- 
Karhunen  Analysis  can  be  considered  equivalent  to  its  discrete 
analog,  which  is  well  known  in  jpulti variate  statistical  theory  as 
Principal  Component  Analysis. 

3.  Matrix  Formulations 


Principal  Component  Analysis  is  concerned  with  the  properties 
of  the  eigenvalue  a  and  eigenvectors  of  a  scatter  matrix  which  can  be 
construed  to  be  the  autocovariance  matrix  of  a  time  series.  (In 
Ref.  1  this  matrix  is  also  referred  to  as  the  "density  matrix"  of  the 
autocorrelation  function. )  This  analogous  representation  can  be 
stated  as  follows:  Given  a  sero  mean  random  process  (x  (t)  }, 

0  <  t  <  T,  sampled  at  M  equal  intervals  so  that  x^.  *  x^  (iT/M). 

An  autocovariance  matrix  can  be  computed  with  elements 
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(7) 


Alj  *  M 


Then  the  matrix  equation 

.a  i  ■  4  (*> 

is  analogous  to  (2)  and  the  eigenvectors  ^  and  eigenvalues  V  enjoy 
properties  analogous  to  those  described  above. 

Principal  Components  Analysis  is  more  widely  known  than  L-K  ^  ^ 
Analysis  and  has  been  applied  to  the  processing  of  biological  signals.  * 

A  semantic  difficulty  exists  in  that  a  variety  of  names  are  used  by 
various  researchers  in  referring  to  identical  or  similar  techniques. 

In  addition  to  Loeve -Karhunen  and  Principal  Components,  Principal 
Factors,  Discriminant  Analysis  and  others  are  mentioned  in  the 
literature.  In  Section  Vlll  an  attempt  will  be  made  to  clarify  the 
relationship  between  the  various  techniques, 

4.  Experimental  Results 

A  problem  which  continually  arises  when  evaluating  various 
techniques  for  analyzing  and  classifying  signals  of  biological  origin 
is  the  following:  Are  the  errors  caused  by  the  shortcomings  of  the 
technique;  or  because  the  signals  were  incorrectly  labeled,  or  be¬ 
cause  the  effect  of  the  attribute  being  studied  is  dominated  by  some 
other  variable  attribute,  or  insufficient  training  set,  etc. 

As  an  effort  to  decouple  these  errors,  the  strategy  followed  has 
been  to  generate  random  signals  for  preliminary  checking  of  each 
evaluation  procedure.  The  ability  of  the  procedure  to  correctly 
classify  these  synthetic  random  signals  gives  no  index  of  its  effective¬ 
ness  or  genuine  biological  signals,  but  clearly  is  a  necessary  con¬ 
dition  for  the  same . 

The  evaluation  procedure  followed  is  as  shown  schematically  in 
Fig.  1.  The  ensembles  {xj  *it)  }  and  {xj  B  (t)  }  are  iynthe sized  in 
the  computer  (IBM  7094).  Each  sample  function  of  these  ensembles 
is  measured  in  two  ways:  the  measurements  shown  here  are  l)  the 
power  content  in  various  frequency  bands  Pj  and  2)  the  Loeve  - 
Karhunen  coefficients,  c^  .  We  desire  a  figure  of  merit  which 


I 

k=l 


xki  v 
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indicates  the  discriminating  power  of  the  alternate  measurement 
sets  relative  to  the  input  ensemble  {x^  ( t ) } .  *  This  figure  of  merit 
is  defined  as  shown  on  Fig.  1 ,  where  p  and  0 T are  measured  along  the 
Z  axis.  The  Z  axis  has  been  located  by  a  discriminant  analysis 
technique  to  provide  the  optimal  separation  of  the  projections  of  the 
two  sample  clusters. 

Such  experiments  were  conducted  on  two  groups  of  computer¬ 
generated  time  series  and  on  spontaneous  EEG's  of  one  subject 
with  eyes  alternately  open  and  closed.  The  results  are  shown  in 
Table  I. 


TABLE  1 


EXPERIMENTAL  RESULTS 


C  ompu  te  r  -  ge  ne  ra  te  d 

Spontaneous  EEG 

Figure  of  merit 

time  series 

eyes  open/eyes  closed 

Band-pass  powers 
(six  measurements) 

8.7 

3.4 

L-K  analysis 
(six  measurements) 

11.  3 

4.7 

L-K  analysis 
(one  measurement) 

9.2 

2.8 

fThe  power  measurements  were  made  with  respect  to  the  six 
frequency  bands:  1.  5-  .  5,  .  5-7.  5,  7.  5-9.  5,  9.  5-12.  5, 

12.  5-17.  5,  and  17.  5-25  cps;  which  is  one  of  the  many  par¬ 
titionings  of  the  frequency  spectrum  which  have  been  employed 
by  EEG  researchers. 


♦See  Reference  25  for  discussion  of  this  figure  of  merit. 
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5.  Shape  of  the  Eigenfunction* 


The  Loeve-Karhunen  (or  Principal  Component)  Analysis  may 
be  thought  of  as  a  generalised  spectral  representation  of  a  random 
process.  In  this  generalized  representation  the  components  are  not 
limited  to  the  family  of  sinusoids  as  in  Fourier  analysis  but  instead 
are,  as  has  been  described,  chosen  on  the  basis  of  economical 
approximation. 

In  addition  to  the  efficiency  of  the  L-K  representation,  another 
less  well-known  advantage  may  result  from  the  unique  properties  of 
the  method.  This  advantage  follows  from  the  fact  that  the  method 
systematically  finds  the  largest  constituents  of  a  random  signal, 
subject  to  the  constraint  that  their  coefficients  be  uncorrelated.  If 
then,  an  assumption  of  gaussianness  can  be  justified,  the  eigen¬ 
functions  can  be  conceived  as  outputs  of  (statistically)  independent 
mechanisms.  Therefore,  a  research  worker  confronted  with  the 
problem  of  analyzing  a  complex  waveform  comprised  of  the  sum  of 
waveforms  generated  by  a  number  of  independent  sources  may  be 
able  to  associate  physical  meanings  with  the  various  eigenfunctions. 

Figures  2  and  3  show  the  first  ten  eigenfunctions  of  the  computer¬ 
generated  signals  and  the  EEG's,  respectively.  It  is  interesting  to 
note  that  the  eigenfunctions  of  the  EEC  are  not,  in  general,  sinusoidal 
but  do  exhibit  periodicities  closely  related  to  the  well-known  alpha, 
beta,  and  theta  frequencies  which  have  been  postulated  on  the  basis 
of  Fourier  analyses.  Figure  4  shows  the  autocorrelograms  from 
which  the  functions  of  Figure  3  were  derived. 

6.  Inadequacy  of  L-K  for  Classification 

Despite  the  advantage  indicated  previously  relative  to  ths  con¬ 
ventional  methods,  Loeve-Karhunen  analysis  is  clearly  suboptimal 
with  respect  to  the  classification  problem,  since  it  does  not  make 
use  of  available  information  concerning  the  correlations  within  the 
several  classes.  In  order  to  clarify  this  point,  consider  the  hypo¬ 
thetical  set  of  data  represented  in  Figure  5  in  which  each  point 
represents  a  sample  time  function  c.  the  state  indicated,  and  three 
dimensions  of  an  m-dimensional  vector  space  are  shown. 

As  described  above,  the  effect  of  Loeve-Karhunen  analysis  is 
to  define  a  sub  space  of  any  dimension  n  <  m  such  that  the  sample 
points,  when  projected  on  this  subspace,  are  dispersed  maximally 
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Fig.  3  Eigenfunctions  of  spontaneous  electroencephalograms 
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in  the  mean- square  sense,  without  regard  to  any  inherent  grouping 
of  the  points.  It  follows,  therefore,  that  L-K  treatment  of  the  data 
as  shown  in  Figure  5  will  be  the  same  whether  the  purpose  of  the 
processing  is  to  infer  the  state  of  the  eyes,  or  the  level  of  alertness 
It  is  evident  that  an  optimal  method  would  necessarily  be  a  function 
of  the  use  for  which  it  is  intended.  Various  means  of  accomplishing 
this  are  discussed  in  the  following  section. 


Eyes  Open 

Eyes  Closed 

Ale  rt 

X 

• 

Drowsy 

• 

• 

Fig.  5  Geometrical  Representation  of  Data 
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SECTION  in 


DISCRIMINANT  ANALYSIS 

1.  Discussion 

Discriminant  Analysis  diffsrs  from  L-K  physically,  in  that  it 
assumes  the  existence  of  a  set  of  labeled  data,  i.  e. ,  data  whose  correct 
class  membership  is  known.  This  class  information  is  exploited  in 
that  the  sample  covariance  of  the  subgroups  are  computed  as  well  as 
that  of  the  aggregate.  Then,  formally,  where  L-K  or  Principal 
Components  is  based  on  the  solutions  of  the  simple  eigenvalue  problem 

C  A  -  XO  =  0  , 

Discriminant  Analysis  is  based  on  the  solutions  of  a  generalised 
eigenvalue  of  the  form 


[  B  -  XC  ]  £=  0  , 

where  B  and  C  are  functions  of  the  aggregate  and  subgroup  covariances 
respectively,  the  precise  definitions  of  which  will  be  presented  in 
Section  VIII. 

The  solution  of  the  simple  problem  defines  a  subspace  which 
maximizes  the  scatter*  of  the  sample  points.  Th<'  solution  of  the 
generalized  problem  defines  a  subspace  which  maximizes  the  scatter 
of  the  groups,  keeping  the  within  group  scatter  constant. 

In  contrast  to  the  simple  problem,  the  solution  of  the  generalized 
problem  requires  matrix  inversion.  Therefore  a  singularity  problem 
arises  when  the  sample  size  is  less  than  the  dimensionality  of  the 
original  space. 

We  have  evaluated  experimentally  the  following  three  different 
means  of  circumventing  the  singularity  problem.  The  use  of  the 
Moore -Penrose  Generalized  Inverse  for  inverting  the  singular 
within  sample  scatter  matrix;  the  ’H'  inverse,  a  method  suggested 


*  See  Section  VIII  1  for  the  definition  of  "scatter.  " 
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by  T.  J.  Harley  *r.H  described  in  Section  VIII  3;  and  finally  the  method 
suggested  by  us  of  using  the  L-K  method  to  reduce  dimensionality  and 
then  applying  D.  A.  The  D.  A.  program  has  been  provided  with  the 
option  of  taking  the  'H'  inverse  and  th£  program  described  as  program 
G  accomplishes  L-K  followed  by  D.  A. 

On  the  basis  of  the  experiments  we  have  performed,  L-K  followed 
by  D.  A.  gave  best  results.  As  explained  in  Section  VIC  2,  D.  A.  can 
also  be  accomplished  via  prewhitening  followed  by  K-L. 

2.  Experimental  Results 

The  following  experiment  was  conducted  on  a  set  of  60  visually  evoked 
EEGs  recorded  at  the  Mayo  Clinic.  Thirty  samples  were  recorded 
with  subject's  eyes  open,  30  with  eyes  closed.  Each  set  of  30  was 
divided  so  that  samples  1,  3,  5,  .  . . ,  29  comprise  an  analysis  set  and 

samples  2,  4,  6 . 30  comprise  a  test  set,  where  the  numbers 

indicate  the  time  sequence  of  occurrence.  (This  division  of  the 
samples  was  chosen  to  minimise  the  effect  of  any  nonstationarity  of 
the  data.  )  Using  various  methods,  then,  the  analysis  data  were  used 
to  define  the  directions  of  the  corresponding  lines  on  which  sample 
points  could  be  projected  from  the  original  95  point  vector  space.  The 
separation  of  the  test  data  when  projected  on  one  or  another  of  these 
lines,  provides  a  measure  of  the  efficacy  of  the  respective  methods 
by  which  the  directions  are  defined. 


Figure  6  shows  the  result  of  projecting  the  analysis  and  test  data, 
first  on  the  line  by  L-K  Analysis,  and  then  on  the  line  defined  by 
Discriminant  Analysis.  Because  of  the  singularity  problem,  the 
Discriminant  Analysis  is  performed  on  the  data  after  it  has  been 
reduced  from  95  to  25  dimensional  space  L-K. 

Figures  8  through  11  represent  the  data  used  in  this  experiment 
and  the  marked  signals,  i.  e.  ,  records  2,  10,  12,  15  for  eyes  open 
test  data  and  record  5  for  the  eyes  closed  test  data,  represent  the 
errors  indicated  on  Figure  6. 


♦The  computer  programs  are  described  and  listed  in  reference  27. 
Copies  of  the  programs  may  also  be  obtained  from  Mathematics 
and  Analysis  Branch,  Aerospace  Medical  Research  Laboratories, 
Wright -Patters  on  Air  Force  Base,  Ohio. 
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The  reaulte  demonstrate  the  characterietlc  tendency  of  the  L-K 
projection  to  be  maximally  dispersed  without  necessarily  effectively 
separating  the  classes  of  Interest.  The  D.  A. ,  on  the  other  hand, 
demonstrates  the  tendency  to  separate  the  classes  of  Interest. 

Figure  7  Illustrates  the  case  when  L-K  analysis  would  Indicate 
a  line  of  projection  which  does  not  effectively  separate  the  two  classei 
but  D.  A.  does  find  the  appropriate  line  of  Interest.  If  the  two  cluster 
centers,  In  this  example,  were  far  apart  L-K  would  indicate  an 
appropriate  line  of  projections. 

A  third  experiment  was  conducted  which  verified  the  formal 
equivalence  between  D.  A. ,  and  prewhltenlng  followed  by  L-K. 
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SECTION  IV 


EVALUATION  OF  DECISION  PROCEDURES  -  BAYES  VS 
MINIMUM  DISTANCE 


The  Biyen  Decision  Procedure  has  certain  desirable  properties, 
such  as  the  ability  to  nonlinearly  partition  the  vector  space  and  to 
provide  minimum,  expected  loss  decisions  under  certain  conditions. 
However,  its  implementation  requires  either  the  fitting  of  multi¬ 
dimensional  density  function,  or  the  discretisation  of  the  measurement 
space  and  the  empirical  determination  of  M  (NK  +  1)  probabilities, 
where  M  is  the  number  of  classes,  N  is  the  number  of  measurements, 
and  K  is  the  number  of  quantization  levels.  It  is  difficult  to  predict 
the  variance  of  these  tabulated  probabilities,  but  it  may  be  noted  that 
many  hundreds  of  samples  are  usually  used  in  making  such  a  tabu¬ 
lation  in  character  recognition  work. 

Assuming  that  such  sample  sizes  may  often  be  prohibitive  in 
biological  work,  an  attempt  has  been  made  to  evaluate  Bayes  relative 
to  other  decision  methods  for  small  samples,  and  to  provide  an 
alternate  decision  program  for  such  cases. 

The  Bayes  decision  program  was  tested  on  the  six -dimens tonal 
sample  points  produced  by  the  power  band  and  the  L-K  measurements 
of  the  computer -generated  time  series.  As  a  consequence  of  the 
small  number  of  samples  available  for  training  (five  for  each  class  in 
each  case),  only  two  of  ten  test  samples  could  be  classified  correctly 
in  one  case  and  none  of  ten  in  the  other;  i.  e.  ,  no  decision  in  18  of 
20  test  samples.  Although  this  sample  sise  is  extremely  small,  the 
point  is  that  simple  decision  procedures  based  on  distance  measures 
can  discriminate  reliably  under  such  conditions.  For  example,  in 
this  case,  the  sign  of  the  first  L-K  coefficient  is  a  sufficient  criterion, 
with  ample  margin  of  safety,  to  classify  all  the  test  samples.  The 
results  indicate  that  an  alternative  decision  procedure  is  needed  for 
problems  involving  small  sample  size.  The  discriminant  analysis 
procedure  described  in  the  previous  section  is  one  such  approach. 

This  approach,  when  used  in  conjunction  with  some  algorithm  for 
setting  threshold  and  a  means  of  circumventing  the  singularity  problem, 
is  recommended  for  small  sample  decision  problems. 
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SECTION  V 


EVALUATION  OF  POWER  SPECTRUM  PROGRAM 


In  the  program  supplied  with  Reference  1,  the  power  spectrum 
is  obtained  by  computing  the  sample  autocorrelation  function  and  then 
taking  its  Fourier  transform.  This  program  was  used  many  times 
during  this  year's  experimentation,  both  on  synthetic  time  series  and 
on  spontaneous  EEGs.  Investigation  of  certain  spurious  components 
of  the  computed  spectra  indicated  the  advisability  of  making  the 
following  restriction  and  revision: 

1)  The  autocorrelation  function  p  (y)  should  not  be  computed  for 
lags  greater  than  10%  of  the  interval 

2)  A  "gate  function"  g  (y)  ,  providing  more  smoothing  than  does 
g  (y )  s  1 ,  0  <  y  <  X  .  must  be  used.  Good  results  have  been  obtained 
with  the  "hamming"  function^. 

With  these  changes,  the  program  gave  generally  satisfactory 
results  except  that  the  spectra  thus  derived  are  still  extremely 
sensitive  to  sampling  error  in  the  autocorrelogram.  This  sensitivity 
has  been  commented  on  by  other  investigators***  An  alternative 
method  of  power  spectrum  determination  exists  which  does  not  re¬ 
quire  the  autocorrelation  function,  and  which  therefore  may  yield 
more  consistent  results*^.  This  "direct"  method  was  used,  successfully, 
to  compute  the  spectra  of  some  EEG  data.  However,  it  was  decided 
not  to  carry  out  any  comparative  evaluation  of  the  two  methods  but, 
rather,  to  attempt  to  develop  a  testing  procedure  which  eliminates  the 
need  for  computing  the  power  spectrum  altogether.  The  results  of 
this  effort  are  reported  in  the  following  section. 
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SECTION  VI 


TIME-DOMAIN  PROCESSING  OF  STOCHASTIC  SIGNALS 


Certain  classes  of  neurological  signals  are  stochastic:  i.  e. , 
unpredictable  in  detail,  with  only  certain  statistical  properties  of 
the  signal  knowr  Spontaneous  EEGs  exemplify  such  signals.  The 
methods  of  orthogonal  analysis  and  classification  developed  under 
this  contract  do  not  apply  naturally  to  stochastic  signals,  since  the 
method  of  generating  coefficient  assumes  the  signal  to  have 
deterministic  content  with  respect  to  some  point  of  time  reference. 

A  standard  procedure  of  transforming  stochastic  signals,  if  stationary, 
into  deterministic  signals  is  to  compute  the  power  spectrum  or  auto¬ 
correlation  function,  and  then  to  use  that  as  the  signal  to  be  analysed 
rather  than  the  original  time  function.  (This  was  the  procedure 
suggested  in  Ref.  1.)  This  artifice  has  the  disadvantages  of  consuming 
computer  time,  introducing  additional  error,  and  producing  eigen¬ 
functions  which  are  not  interpretable  in  terms  of  the  time  functions. 

A  time-domain  procedure  which  eliminates  these  shortcomings 
has  been  derived  and  programmed  (see  Sections  VIII  6).  This 
procedure  can  be  implemented  in  real  time  via  a  bank  of  filters 
matched  to  the  eigenfunctions  used  in  conjunction  with  delay  lines, 
multipliers  and  integrators;  or  via  a  relatively  simple  digital  program. 
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SECTION  vn 


EXPERIMENTAL  FACILITY  FOR  COLLECTING  EEG  DATA 


In  order  to  obtain  empirical  data  upon  which  these  analyses  might 
be  tested,  equipment  was  designed  and  built  to  elecit  responses  from 
the  human  visual  cortex  by  photic  stimulation.  By  placing  small,  high 
powered  collimating  and  Maxwellian  lenses  close  to  a  glow  modulator 
tube  (Sylvsnia  R1166),  it  is  possible  to  produce  a  uniform  field  of 
72,000  feet.  Lamberts  covering  a  60°  solid  angle  (Fig.  13).  This  is 
suffi  ciently  bright  that  a  beam  splitter  reducing  the  brightness  to 
24,  000  feet.  Lamberts  might  be  used  to  provide  the  option  of  super¬ 
imposing  another,  independent  field  on  the  first,  and  tu  permit 
monitoring  of  the  two  fields  by  photocell.  The  second  field  is  at 
present  being  used  to  supply  a  fixation  point  consisting  of  a  15-45 
Pinlite  (Kay  Electric),  reduced  in  intensity  and  reddened  by  a  series, 
variable  resistor. 

The  advantages  of  the  Maxwellian  view  are  threefold:  (a)  it 
directs  into  the  eye  all  of  the  light  incident  within  its  area  instead  of 
scattering  it  as  a  reflecting  surface  or  ground  glass  would  do;  (b)  by 
concentrating  most  of  the  rays  of  light  on  the  center  of  the  pupil,  it 
increases  their  effectiveness  by  virtue  of  the  Stiles -Crawford  effect; 

(c)  but  most  important,  it  focuses  an  image  of  the  source  within  the 
2  millimeter  area  in  the  center  of  the  pupil  within  which  the  pupil 
cannot  constrict;  thereby  eliminating  variations  of  retinal  illuminance 
resulting  from  pupillary  oscillations.  It  does  present  a  problem  in 
measuring  the  equivalent  luminance,  however,  especially  in  this  case 
where  the  lens  is  only  about  7  millimeters  from  the  cornea.  The 
solution  adopted  here  was  to  occlude  half  of  the  collimating  lens  and 
substitute  in  its  place  light  incident  through  the  second  half  of  the 
beam  splitter.  Once  the  two  have  been  matched,  the  second  field  can 
be  measured  by  conventional  means. 

The  luminous  flux  of  a  glow  modulator  is  a  nearly  linear  function 
of  its  anode  current,  but  along  with  this  change  in  flux  is  a  change  in 
hue.  Consequently,  for  visual  experiments  it  is  best  not  to  vary  the 
anode  current  for  strictly  comparable  results.  This  problem  was 
circumvented  in  the  present  system  by  delivering  the  light  in  90  micro¬ 
second  pulses  of  uniform  shape  and  amplitude.  By  Bloch's  law  the 
effect  of  a  given  amount  of  light  energy  is  independent  of  its  distribution 
over  time,  up  to  a  certain  critical  duration.  Thus  the  apparent 


27 


Fig.  12  Control  Circuit  for  Glow  Modulator. 


Fixation  Rjikt 


Fig.  13  Optical  Stimulus 


brightness  can  be  varied  by  varying  the  number  of  pulses  delivered 
within  this  critical  duration,  which  varies  from  10  to  100  milliseconds 
depending  upon  conditions.  Thus,  the  equivalent  luminance  can  be 
varied  over  a  range  of  two  to  three  log  units  by  this  means,  and  the 
range  can  be  shifted  about  by  the  variable  resistor  connected  to  the 
cathode  of  the  pentode  in  Fig.  12.  The  range  can  also  be  extended 
downward  another  0.  6  log  units  by  delivering  the  pulses  to  the  grids 
of  the  pentode  with  a  device  having  faster  rise  time  than  the  Tektronix 
161  pulse  generator  used  here.  The  tise  time  of  the  light  output  of 
the  glow  modulator  itself  is  under  20  microseconds,  and  the  time 
constant  of  decay  is  less  than  2  microseconds.  (This  is  measured 
with  a  1P42  photocell  in  series  with  a  Tektronix  545  oscilloscope, 
oscilloscope,  shunted  by  a  10K  resistor,  a  very  fast  system  with 
spectral  response  approximately  that  of  the  human  eye. ) 

The  glow  modulator  and  accessory  equipment  are  mounted  on 
a  three -coordinate  manipulator  for  easy  positioning  of  the  image 
within  the  pupil.  The  subject  is  provided  with  adjustable  bit-board 
and  two -point  head  rest. 

The  evoked  response  is  recorded  from  the  scalp  one  inch  above 
and  one  inch  to  the  right  of  the  occipital  protuberance,  with  the  left 
ear  lobe  as  reference.  The  data  logging  system  has  been  described 
in  the  draft  report  submitted  to  the  contracting  agency  at  the 
termination  of  the  previous  contract  year. 
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SECTION  VIII 

MATHEMATICAL  REPORT 


1.  Discriminant  Analytic  Its  Theoretical  Justification  and  Relation 
to  Loevc -Karhuncn  Analysis 


The  basic  assumption  underlying  probabilistic  classification 
into  classes  is  that  there  exists  for  each  class  a  multivariate  proba¬ 
bility  distribution,  the  member  (x| ,  . .. ,  x^)  of  class  i  being  con¬ 
sidered  a  sample  from  a  population  which  is  distributed  in  a  k- 
dimensir  ial  space  according  to  a  certain  probability  distributions. 
Our  approach  to  the  problem  is  that  the  only  knowledge  we  have 
about  the  distributions  is  that  which  can  be  inferred  from  training 
samples  and  no  assumptions  are  made  about  the  functional  forms  of 
the  distributions.  Let  us  consider  linear  classification  procedures 
first,  and  limiting  the  discussion  to  two  classes,  define  linear 
procedures  formally  (the  classification  to  a  finite  number  of  classes 
will  be  accomplished  by  a  repetition  of  the  pairwise  procedure  an 
appropriate  number  of  times).  Let  b  (*  0)  be  a  column  vector  (of 
k  components)  and  c  a  scalar.  An  observation  (signal)  x  is  classi¬ 
fied  a  a  coming  from  the  first  population  if  b '  •  *<c  and  as  from  the 
second  if  b'  •  x  >  c  (the  symbol  '  stands  for  transpose).  The  vector 
b  and  the  constant  c  are  to  be  chosen  to  provide  maximum  discrimi¬ 
nation,  in  some  specified  sense,  between  the  two  classes.  An  ex¬ 
ample  of  such  a  procedure  which  was  originally  considered  by 
Fisher^*  and  the  method  of  statistical  analysis  which  was  developed 
from  the  solution  of  the  problem  is  called  discriminant  analysis.  A 
good  exposition  of  this  solution  is  given  by  Wilks.  7  Discriminant 
analysis  solves  the  following  problems:  Suppose  we  have  two 
samples  from  k-dimensional  distributions,  these  can  be  represented 
geometrically  as  two  sample  clusters  in  Euclidean  k- space. 

(Sample  1  contains  nj  points  and  sample  2  contains  points.)  We 
want  to  project  these  two  sample  clusters  orthogonally  onto  a  line 
so  that  the  variation  between  the  two  projected  samples  is  as  large 
as  possible,  relative  to  the  variation  within  the  two  projected 
samples.  The  terms  "variation  between"  and  "variation  within"  as 
they  are  defined  in  the  solution  are  presented  below.  Using  the 
notation  in  Wilks,  7  suppose  we  are  given  a  sample  of  size  n  from  a 
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will  be  called  the  internal  scatter  matrix  and  its  determinant. 

| U |  the  internal  scatter  of  the  sample.  It  will  be  noted  that  U  is  non¬ 
singular  if  and  only  if  the  n  point  in  the  sample  does  not  lie  on  a  hyper¬ 
plane  of  less  than  k-dimensions.  Now  suppose 


o> 


(1) 


) 


*1  =  1' 


‘1 


and  lJZ)  JZ)  \ 

and  (x^#  ...  x^) 


«2Sl* 


lx'  '  *  ' 

'it, . 

n_  are  two  .ample  a.  Letx'^  »  (r  . l,2be  the 

2  (1)  (2)  1  * 

vectors  of  means  and  U'  ,  U  the  internal  scatter  matrices  of  the 

two  samples  respectively.  Let  x=  (x^  ...»  x^)  be  the  vector  of 

sample  composed  of  the  two  samples  pooled  together.  And  let  Uw  * 

UW  *•  the  within- samples  scatter  matrix  for  the  two  samples. 

Note  that  geometrically  Uw  is  the  scatter  matrix  for  the  k-dimensional 

cluster  one  obtains  by  rigidly  translating  one  sample  cluster  with 

respect  to  the  other  (without  rotation)  until  the  means  of  both  samples 

coincide,  and  then  pooling  the  two  sample  clusters  together  as  a 

single  cluster.  Finally  let  U®*  U-Uw.  For  an  arbitrary  vector  (bj, . . .  b^) 

let 


£y 


l 

i*i 


h  _<y) 
bi  iif  ■ 


L  *  »• 


n 


1.2 


(9) 


or  using  vector  notations 

.0) 


s<;>  =  b‘ 

to  - 


_  (i) 

Thus,  , 


s 


and  x^\ 


.  JY) 

-to 

r(2) 

V 


except  for  scaling,  are 


one  dimensional  samples  obtained  respectively  by  projecting  the 
original  k-dimensional  samples  onto  a  line  whose  direction  cosines 
in  the  original  k-dimensional  space  are  proportional  to  (bj,  ...»  b^). 
See  Figure  14  for  k  =  2. 

Let  and  be  means  of  the  two  samples  of  z's  and  "z  the 
mean  of  the  pooled  samples . 


— .  ■  i  i 
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(11) 


‘.■l 

V  ■  1 

Note  that  if  S  is  the  scatter  of  the  grand  sample  obtained  by 
pooling  the  two  samples  of  z’s  then  S  *  Sw  +  Sg  *  Sw  is  the  within 
sample  component,  i.e. ,  the  sum  of  the  squares  of  the  nj  possible 
segments  (zy)  -  s'*)  )  generated  by  taking  %  ^  and  each  z\P  in  one 
*1  *1 

sample  summed  with  the  sum  of  the  squares  of  the  n^  possible 

segments  (z^  -  s  generated  by  taking  s  ^  and  each  In  the 
*2  *2 

other  sample,  and  is  the  between- sample  component  of  S.  It  is 
also  clear  that 

Sw  =  b'  UWb,  SB  =  b'  U®b  and  S  =  b*Ub.  (12) 

The  problem  is  to  determine  b1  =  (b^ . . . ,  b  )  so  as  to  maximize 

S_  (or  equivalently  to  minimize  S  /  (S,„  +  S_)  tor  a  fixed  value  of 
B  w  WB 

V- 

The  basic  results  concerning  the  solution  of  this  problem  may  be 

stated  as  follows:  The  value  of  (bj,  . . .  ,  b^)  say  (b* . b£)  which 

minimize  the  ratio 


Q  = 


S  +  S_ 

w  B 


so  that  S  has  a  fixed  value  D  t  0,  is  the  solution  of  the  equation 
w 

(U®  -  XJUW)  b  =  0  I 

where  Xj  is  the  nonzero  root  of  the  characteristic  equation 

|UB-XUW|=0,  thus  b*=  (UW)-1  •  (x  W  .  ^(3)  | 


Let  us  note  that  if  UW  =  I  (I  is  the  identity  matrix)  Q  takes  the 
following  form 
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(16) 


minimising  Q  while  keeping 


keeping 


IV 


bi2" 


*  D  t  0  ii  equivalent  to  maximising 
D  l  0  and  we  can  set  D  =  1. 


We  see  that  when  UW  =  I  the  problem  reduces  to  finding  a  line 

direction  b*  *  (bj, .  b^)  which  will  maximize  the  scatter  of  the 

projected  "points  of  the  two  samples  pooled  together.  The  solution 

to  this  problem  is  Principal  Component  Analysis  and  b'*  =  (b  , . . .  ,  br) 

is  the  eigenvector  corresponding  to  the  largest  eigenvalue  in  tne 

solution  of  the  equation  (U  -  XI )b  =  0.  We  notice,  as  mentioned  before, 

that  this  last  equation  is  identical  to  the  equation  which  is  solved  to 

get  the  eigenvectors  of  the  Loeve-Karhunen  expansion.  If  the  Loeve- 

Karhunen  method  would  have  been  used  to  obtain  the  optimal  line  to 

project  our  samples  on,  for  the  purpose  of  discrimination,  this  line 

would  coincide  with  the  one  obtained  by  the  method  of  discriminant 

w 

analysis  when  the  within.-sample  scatter  matrix  U  is  the  identity 
matrix.  Looking  at  it  geometrically,  what  discriminant  analysis  tries 
to  do  is  to  project  the  two  samples  onto  a  line  so  that  the  means  of  the 
two  one -dimensional  samples  of  points  are  as  far  as  possible  relative 
to  the  within  sample  scatter  of  these  two  one -dimensional  samples. 

The  Loeve-Karhunen  method  tries  to  project  the  pooled  two  samples 
on  a  line  so  as  to  maximize  the  distance  among  all  the  projected 
points.  Since  Uw  =  I  corresponds  geometrically  to  a  spherical  within- 
group  scatter,  it  is  clear  that  in  this  case  making  all  the  projected 
points  in  the  pooled  sample  as  far  as  possible  from  each  other  is 
equivalent  to  maximizing  the  distance  among  the  means  of  the  two 
projected  samples.  See  Figure  15. 
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2.  The  Use  of  Prewhitening  Filters 


The  arguments  presented  in  Section  VIII-  1  suggest  that  solving 
the  discriminant  analysis  problem  is  equivalent  to  first  making  a 
change  of  coordinates,  such  that  in  the  new  set  of  coordinates  the  within 
sample  scatter  matrix  is  spherical,  and  then  solving  the  Loeve -Karhunen 
problem  i.  e. ,  finding  the  eigenvector  corresponding  to  the  largest 
eigenvalue.  We  shall  derive  the  appropriate  change  of  coordinates  to 
accomplish  this. 


If  we  consider  each  sample  vector  as  (*,*  - 

(y)  #(y)  .  (y)’  .  ,  e  . 

,Xk^  )  "  %  Y  -  *.  -  l,...«yt.e..  i 


2 


Y=l,2,  n^,  i.  e. ,  subtract 


from  each  sample  vector  the  sample  mean,  and  denote  the  (n^-fn^)  * 
k  matrix 


*2 

*1^+1 


*  2 

Xfc  i+t 


•  2 

’‘‘W 


then  U  =  X'X. 


knl+n2 


If  we  make  a  change  of  coordinates  such  that  the  vector  x  '  ”  is 

(y)  (y) 

transformed  to  the  vector  y  the  transformation  X.^y  ~ 

=  (UW)  *  x  then  Y  ~  X  [(UW)  ^  ]’  and  if  we  denote  the  within 

*  ^  w 

scatter  matrix  in  the  new  coordinate  system  as  V  we  have 

VW  =  Y’Y  =  (UW)^  X,Xt(UW)'*J'  =(UW)"^(U*r)  *(lT)  I  (18) 


We  conclude  that  the 
the  within-group  scatter 


ie  appropriate  change  of  coordinates  for  making, 
r  spherical  is  the  transformation  Y-£y  =  (UW) "  “  , 
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In  the  language  of  communication  engineer*,  the  linear  trans¬ 
formation  [  U  j  "  a  is  called  a  prewhitening  filter  because  of  the 
terminology  of  white  noise  used  to  denote  noise  which  has  a  Covariance 
matrix  I. 

The  fact  that  O.  A.  cam  be  accomplished  by  prewhitening  followed 
by  L-K  can  also  be  demonstrated  as  follows: 

The  optimal  line  of  projection  b*  is  found  by  the  method  of  D.  A. 
as  the  eigenvector  corresponding  to  the  largest  (for  two  groups  the 
non -zero)  eigenvalue  of  the  characteristic  equation 


1(11®  - 

XUW)|  =  0 

(19) 

or  the  solution  of 

(U®-  X 

w 

jU  ) b  =  0 

(20) 

and  then  we  obtain 

ii 

M 

*»  (Y) 
b  •  x 

-iy 

(21) 

*  B  w 

where  both  x  and  b  are  column  vectors.  Since  U  =  U  -  U  , 

Eq.  20  can  be  rewritten  as  (U  -  (1  +  Xj)UW)b  =  0. 
or 

[(UW)^U(UWr^{l  +  xi>1]  W")**  =  0  (22) 

w  V 

setting  (1  +  X)  -  X  *  and  (U  )*b  =  c,  we  obtain 

[(UW)^U(UW)'^-(X1’)l]  c  =  0  (23) 

w  -A 

Since  by  prewhitening  we  multiply  each  incoming  signal  by  (U  )  2  A 

the  internal  scatter  matrix  of  the  grand  sample  becomes  (Uw)”tu(U  )"2 
if  we  now  apply  L-K  (or  equivalently  principle  components)  analysis 
and  pick  the  eigenvector  corresponding  to  the  largest  eigenvalue  of  the 
characteristic  equation 

J(UW)”2U(UW)‘^-  X'l)  j  =  0  (24) 

we  obtain  the  optimal  line  to  project  on,  c  ,  as  the  solution  of  the 
equation 
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and  we  aee  that  except  for  (poaaibly)  a  ecale  factor  we  have 
t  wl  * 

c  =  (U  )*b  .  (26) 

We  note  that  since  each  of  the  incoming  vectors  was  multiplied  by 
w 

(U  )  *  the  z's  (except,  possibly,  for  a  scale  factor)  will  become 


(Y)  *' 

sv  =  c 

t  Y  - 


<un 

i.  e. ,  the  same  as  Eq.  21. 


JYK  =  K*'/rrw\i.  — w* 


x-)  =  b  (U  y 


(U~) 


S' 


iv) 


-tv  -  -tv 


(27) 


It  is  appropriate  to  point  out  that  if  we  think  of  one  of  the  classes 
as  being  "noise"  and  the  other  as  being  "signal  plus  noise"  (again  using 

k 

communication  engineering  terminology)  and  identify  ^  b^  x^  as  the 

i=l 

output  of  a  discrete  filter,  then  the  ratio  Q  maximized  in  the  discriminant 
analysis  problem  is  seen  to  be  "signal-to-noise  ratio".  Thus,  coefficients 
bj*  define  the  filter  which  maximizes  the  signal -to -noise  ratio. 

3.  The  Singularity  Problem 


For  linear  classification,  discriminant  analysis  is  used  to  reduce 
the  dimensionability  of  the  sample  space  to  one  dimension  and  a  threshold 
is  used  to  classify  £he  sample  points.  The  optimal  direction  to  project 
the  sample  points  b  is  given  in  terms  of  the  inverse  of  the  within  samples 
scatter  matrix.  The  problem  of  singularity  of  this  matrix  arises  when 
discriminant  analysis  is  applied  to  a  problem  with  a  high  dimensional 
sample  space  and  a  small  sample  size.  The  internal  scatter  matrix  is 
nonsingular  if  and  only  if  the  n  points  in  the  sample  do  not  lie  on  a  hyper¬ 
plane  of  less  than  k -dimensions,  k  being  the  dimensionability  of  the 
sample  apace.  If  n^  +  n^  <  k  +  2,  n^  and  being  the  number  of  samples 
for  group  one  and  group  two  respectively,  the  matrix  Uw  is  singular, 
i.  e.  ,  its  inverse  does  not  exist. 

In  biological  classification  problems  the  sample  size  may  be  small 
relative  to  the  dimensionality  of  the  sample  space.  Since  the  Loeve- 
Karhunen  method  is  optimal,  in  the  mean  square  sense.,  for  reduction 
of  dimensionality,  it  can  be  used  for  tackling  small  sample  problems 
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in  a  high  dimensional  space.  The  approach  would  be  to  reduce  the 
dimensionality  of  the  sample  space  using  the  Loeve-Karhunen  method 
to  a  point  where  Ri  +  n2  "  2  ia  larger  than  the  dimensionality  of  the 
sample  space  and  then  apply  discriminant  analysis. 

There  are  a  few  other  methods  which  have  been  suggested  for 
the  solution  of  the  small  sample  problem  which  results  in  singularity. 
One  such  method  is  to  use  the  Moore-Penrose  Generalized  Inverse^* 
for  inverting  the  singular  within-sample  scatter  matrix.  This  method 
effectively  restricts  our  search  for  best  projection  line  to  a  subspace 
of  the  sample  space,  which  is  orthogonal  to  the  subspace  in  which 
there  is  no  scatter.  Thus,  one  can  eliminate  "good”  directions  this 
way  and  examples  can  be  constructed  in  which  this  is  just  the  wrong 
thing  to  do.  Another  suggestion  is  to  replace  the  zero  eigenvalues  of 
Uw  by  the  average  eigenvalue  thus  making  the  matrix  nonsingular 
Or  more  precisely,  if  Q  is  the  matrix  whose  columns  are  the  eigen¬ 
vectors 


D  =  DIAG  <X, ) 
I  n 


trace  D  =  sum  of  the  elements  on  the  main  diagonal  of  the  matrix  D 

m  =  total  number  of  samples 

n  =  dimension  of  space 

I  =  identity  matrix 
|| 

A  will  be  referred  to  as  'H'  inverse  (*H'  standing  for 
T.  J.  Harley  who  suggested  it). 

4.  Principal  Component  Analysis ^ 

The  Loeve-Karhunen  expansion  theorem  states  that  a  random 
process  in  an  interval  of  time  Q  may  be  written  as  an  orthonormal 
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series  with  uncorrelated  coefficients.  The  expansion  is  in  terms  of 
the  eigenfunctions  of  the  autocovariance  function  of  the  process. 
When  the  statistics  of  the  process  are  unknown  and  a  sample  auto¬ 
covariance  is  taken  as  an  estimate  of  the  autocovariance  function  of 
the  process,  an  expansion  in  terms  of  the  eigenvectors  of  the  sample 
autocovariance  matrix  is  identical  to  Principal  Component  Analysis 
in  multivariate  statistics. 


Consider  a  sample  of  size  n  from  a  k -dimensional  distribution, 
k  <  n.  This  sample  may  be  represented  geometrically  as  a  sample 
cluster  of  n  points  in  k-dimensional  Euclidean  space  R,.  Suppose 
we  wish  to  project  this  cluster  orthogonally  onto  an  s -dimensional 
Euclidean  space  R#,  s  <  k,  so  as  to  obtain  the  greatest  possible 
s -dimensional  scatter  of  the  projected  points  (the  term  scatter  will 
become  clear  in  the  solution  below).  The  problem  is  to  determine 
the  direction  of  projection  with  respect  to  the  coordinate  system  of  R^. 


The  solution  of  this  statistical  problem  can  be  stated  in  the 
following  result  due  to  Hotelling  (1933): 


Suppose  (x^£  ,  . .  . ,  x^t  i  £  *  1 . n)  is  a  sample  of  size  n  >k 

from  a  k-dimensional  distribution  whose  covariance  matrix  is  positive 
definite.  Let  j  ju. .  ||  be  the  internal  scatter  matrix  of  this  sample,  i.  e. , 

n  n 


i  si  e=i 


t. ,  ,  i  =  1,  2,  .  . . ,  k  where  x  =  —  /  x._  ,  i  =  1,  2,  ...»  k 

i£  i  “  L  i£ 


and  let  it  be  positive  definite  with  probability  1.  Let 

(c  j  i  •  •  •  »  )*  p  *  li .  .  .  i  •  .  ( 28) 

P  P 

s  k-dimensional  unit  vectors,  that  is,  such  that 
1 ,  .  .  .  ,  s,  and  set 


£  -  !»•••»  n). 


be 
P  = 
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The  value*  of  the  vector*  which  maximise  the  scatter  fu  |  are 

the  solutions  of  the  s  sets  of  equations  *** 

k 


X 


(u..  -  X  b.t)  c.  =  0. 

ij  P  U  JP 


1  ■  1 . k 


(30) 


p  =  1 . s  where  X^ . Xgl  are  the  s  largest  roots  of  the  char¬ 

acteristic  equation 


-  0.  (31) 

being  the  Kronecker  6,  and  where  \\>  .  . .  >  Xg  with  probability  1. 
Furthermore,  these  vectors  are  orthogonal,  and  the  maximum  value 
of  |u  )  is  the  product  XjX£. .  ,Va.  The  proof  of  the  above  statement 
can  t/e*found  in  Ref.  7. 


Both  the  above  analyses  are  essentially  a  least  mean  squares 
linear  fit  of  a  set  of  S  orthonormal  waveforms  to  a  set  of  n  observed 
waveforms.  We  also  note  that  Principal  Component  Analysis  is  some¬ 
times  referred  to  as  Principal  Factor  Analysis.  23 

5.  The  Choice  of  Threshold  in  Linear  Classification  Problems 


In  linear  classification  procedures  an  observation  (signal)  x  is 
classified  as  coming  from  the  first  population  if  b'  •  x  <  c  and  as  from 
the  second  if  b'  *  x  >  c.  The  vector  b  and  the  constant  c  are  chosen  to 
provide  maximum  discrimination,  in  some  specified  sense,  between  the 
two  classes.  In  this  section  the  choice  of  c  will  be  discussed. 

Let  us  first  assume  that  b'  *  x  has  one  of  two  possible  distributions 

N(  v.y  or  N  (p  (L,^)  i.  «•  .  normal  with  mean  p^  and  variance 

Cj2  or  normal  with  meanp^  an<*  variance  (J^2.  We  lose  no  generality 
if  we  letp^  >  P  1  and  designate  the  population  I.  Let  us  denote  by  Lj 
the  loss  associated  with  the  misclassification  of  an  individual  from 
population  I  and  by  L^  the  loss  assc-iated  with  misclassification  of  an 
individual  from  population  II,  Lj,  L^>  0.  Let  us  further  denote  the 
a  priori  probability  of  population  I  by  p  and  of  population  II  by  q  =  1 -p. 
Let  Pj  be  the  probability  that  a  random  observation  (signal)  from 
population  I  is  classified  as  having  arisen  from  II,  and  Pjj  the  probability 
that  a  random  individual  of  II  is  classified  as  having  arisen  from  I.  We 
wish  to  obtain  the  constant  c  which  minimizes  the  expected  loss,  i.  e.  , 
min  [  LjpPj  +  L2  qPnJ. 
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’« \etJ 


»vc 


c  “  V- 


-  x2/z 


2  2/, 

-  -  x  /  2 


2  • 


Taking  the  partial  derivative  with  reepect  to  c  wa  obtain 


KLjpPj  +  l^qPjj) 


[■ 


<\yi f 


,  L2q 

+  — —  ■  —  exp 

a2/2i“ 


[•i 


c  -h 


Equating  the  derivative  to  aero  and  rearranging,  we  obtain 

L2q<7  1  *V  C  2  C"»l2  2 

2In  Hr  +  (~  )  -  -  0 

1P  2  1  2 

which  is  a  quadratic  in  c  with  the  following  roots 
*  *  1*1*2 


,  L  p(T  -,1/21 

2(02  -  0,T  In  | 


There  are  three  possibilities  for  the  roots  of  equation  35.  There 
are  no  real  roots  (Fig.  16)  no  roots  fall  in  (pj.p  \  interval  (Fig.  17) 
one  and  only  one  root  falls  in  (pj,<j  ,)(Fig.  18).  Il  a  root  should  fall 
at  one  of  this  may  be  co.isiHered  as  a  limiting  case  of  the 

situation  when  no  roots  fall  in  (p^.p^).  When  there  are  no  real  roots 
the  situation  is  trivial,  and  all  individuals  are  classified  into  one 
population.  When  no  roots  fall  in  (pi»p^  linear  discrimination  is  not 
very  helpful,  and  quadratic  discrimination  is  indicated.  Let  us  con¬ 
sider  the  situation  when  one  and  only  one  root  falls  in  (pj.p^).  When 
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Ml  Mz 


Fig.  16  No  Real  Root* 


Ml  M 2 

Fig.  17  No  Roots  in  (|*j,  Interval 


Ml  M2 


Fig.  18  One  Root  in  Nerval 
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a  root  falls  in  (p},^)’  *****  *•  **»e  root  which  minimises  (LjpPj  +  L^q  Pjj») 
and  is  therefore  the  root  desired.  The  other  root  maximises 
(LjpP|  +  L^qPjj)  and  therefore  will  not  b«  used.  We  also  notice  that 
when  Dj  <  (?2  the  root  which  falls  in  (pppj*)  **  ***e  *arfer  of  the  two 
and  when  (J^<  (Jj  it  is  the  smaller,  thus  in  both  cases  the  positive 
square  root  is  required. 


Let  us  now  consider  the  case  when  neither  a  priori  probabilities 
nor  loss  functions  are  known.  There  can  be  two  criteria  for  choosing 
c.  One  would  be  the  c  which  minimises  the  sum  of  the  two  types  of 
errors,  and  the  other  which  minimise  the  larger  of  the  error  quantities. 


Consider  now  the  minimising  of  max  (P^,  PU>-  .  Since  Pj  and  PH 
are  monotonic,  decreasing  and  increasing  respectively,  in  c,  the 
desired  c  is  located  such  that  Pj  =  PU*  From  Eq.  32  and  Eq.  33  we 
obtain 


thus. 


(37) 

(38) 


Solving,  we  obtain 


“J 

V°1 


(39). 


Setting  Lj  -  =  p  =  q  =  1  in  Eq.  36  we  obtain  (taking  the 

positive  square  root)  the  c  which  minimises  (  Pj  +  Pjj] 

_  1 


C2  = 


_  2  2 
Q  -0 
2  1 


>  (40) 


It  should  be  noted  that  if  0^  =  0^,  the  c's  under  both  criteria 
reduce  to  a  c  dependent  upon  only  the  centroids 


(41) 
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Under  tome  condition*  pj  +  p2/2  i#  a  *ood  comPromi,e  between 
C|  and  c3.  In  Ref  (21)  Riffenburgh  and  CIunie*-Ro*a  find  the  condition* 
under  which  c^  i*  better  than  c^  and  condition*  under  which  C£  i* 
better  than  c3.  In  the  B,  A  plane  where  B  *  O-j/ff j  and  A  * 
they  find  four  region*,  in  region  (1)  ro  linear  discriminator  is 
reasonable,  in  region  (2)  c3  is  a  compromise  between  cj  and  c^,  in 
region  (3)  c^  is  better  than  Cj,  and  in  region  (4)  both  c %  and  cj  are 
better  than  c3. 

The  appropriate  threshold  c  is  chosen  according  to  the  above 
considerations. 

6.  Processing  of  Stochastic  Signals 

As  described  in  Section  VI,  a  commonly  used  procedure  for 
analysing  stationary  stochastic  processes  is  to  first  compute  the 
power  spectrum  or  autocorrelation  function  of  each  signal  in  question, 
and  then  to  use  either  of  these  functions  as  the  signal  to  be  analyzed, 
rather  than  the  original  time  signal. 

A  method  will  not  be  presented  which  yields  the  same  results  as  the 
above  mentioned  standard  procedure,  but  which  eliminates  the  need 
for  computing  the  power  spectrum  or  autocorrelation  function  of 
each  signal. 

Consider  the  standard  procedure,  given  a  stochastic  process 
x(t).  First  the  autocorrelation  function  t  (T)  is  computed.  Assume 
the  maximum  correlation  span  of  the  process  to  be  T/2.  Then  an 
auxiliary  function  f(t),  0  <  t  <  T,  can  be  defined  such  that  f(t)  =  ^  (t-T/2) 

These  auxiliary  functions  f(t)  then  are  used  to  define  a  basis  of  eigen- 
functions  which  are  solutions  of  the  equation 

T 

J  *ff(t  -  TH.(T)dT  =  X.^t)  (42) 

o 

The  coordinates  of  any  sample  in  the  subspace  thus  defined  are  given 
by 

T 

Ci  =  ^  *  (t)  i|i.(t)  dt.  (43) 
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The  following  procedure  has  been  developed  so  that  the 
coefficients  can  be  determined  without  computing  the  autocorrelation 
i(t)  =  ♦xx(t“T/2)  for  each  sample  function. 

First  consider  that  the  cross -correlation  between  input,  x(t), 
and  output,  y(t),  of  any  linear  time  invariant  system  with  weighting 
function  h(t)  is  given  by  ^ 

♦xy(T)= 

7  Y.CO 


T 

=  J  gu.T)h(u)du, 


since  +  is  an  even  function,  and  it  is  assumed  that 


h(u>  =0 


u  <0 
u  >T 


But,  from  above,  #  '(u  -  T)  =  £(u  -  T+  T/2) 
so  that 

„  T 


*xy(T)  =  I  f<u-T+T/2>h(u>du* 


and  therefore 


VT/2,=  1 


f  (u)  h (u)  du  . 


.th  ... 


Then,  if  the  weighting  function  h(u)  is  taken  to  be  the  i  eigenfunction 
4^(u),  the  right-hand  side  of  45  becomes,  by  43 

ci  =  f  f(u)4/.(u)du 
Jo 

The  left-hand  side  of  45  by  the  definition  of  the  cross -correlation 
between  stationary  random  functions  is 


♦  (T/2)  =  lim  f  x(t-T/2)y  (t)dt  =  ^t'-T/ZJy  I 

Y;  O-*  oo  ^  1 
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Finally,  therefore,  because  of  the  quality  between  43,  45,  and 
46,  the  coefficient  c^  can  be  evaluated  as  shown  in  Fig.  19  by  averaging 
(for  a  "suitable  length"  of  time,  3)  the  product  of  the  output  and  delayed 
input  of  a  filter  whose  weighting  function  is  the  i^  eigenfunction. 


Figure  19  -  Time  Domain  Processing 

An  experiment  was  conducted  to  provide  a  direct  comparison  of 
the  standard  procedure  and  the  proposed  time  domain  method. 

Computer -generated  random  signals  were  used  as  the  data  set,  with 
20  samples  of  each  class  used  for  training  and  20  for  test.  The  results 
on  test  data  are  shown  in  Fig.  20.  The  results  show  the  time  domain 
method  to  be  significantly  more  effective  in  separating  the  test  samples 
(figures  of  merit  =  6.  25  vs.  4.  5  for  the  standard  procedure.) 

7 .  Another  Proof  for  Theorem  4.  1,  Ref.  1 
Theorem  4.  1 


Two  probability  distributions  {p^  and  {  Xj}  satisfying,  pj  >  0, 

oc  Ot* 

X  ,  >  0,  Zp.  =1,  Z  X  =  1  are  connected  by  a  double  stochastic  matrix 
J-  i=l  1  j=l  J 


A.,  such  that  p.  =  Z  A..  If  the  labeling  of  the  X.  is  done  in  a 
‘j  j=l  U  J  *  J 

descending  order  Xj  >  X ,  >  .  .  .  then,  for  an  arbitrary  n,  the  sum  of  the 

first  n  elements  of  fx.f  is  not  less  than  the  sum  of  the  first  n 
.  r  r  i  J 1 


first  n  elements  of 

elements  of  { ps} 

*  J 

Proof: 

oc  n  * 


not  less  than  the  sum  of  the  first  n 


n  n 


Z  p.  *  Z,  .Z,  A..  X  =  z  Z  A..  X  *  z  Z  A.  .X  (47) 
i=l  i  =  l  j=l  U  J  i=lj=l  U  j  i=l  J=n+1  0  i 


*  This  proof  was  suggested  by  S.  Winograd. 
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Since  \,>  X  >  .  .  .>  X  >X  >  ... 
1“  2—  —  n  —  n-f  1  — 


n 


00 
£ 

p=l  P 


£  oX  >  X  for  any  a  >0  and  1  « 

p=l  P  P  n+1  "  —  -  ■ 

Let  a  * 


=  I 


so 

i^n+1  ^ip 


p  n  sc 

P  I  1  A. 

p=l  i-n+1  lP 

note  that  since  A  is  doubly  stochastic 


(48) 


(49) 


ot  oo  n  « 

£  £  A.  =  £  £  A  . 

p*-l  i=n+l  lP  p=l  i=n+l  Pl 


ot 

.  £, ,  A. 

i=nl 1  ip 


n  sc 
I  £  A  . 
p=  1  i  =  n+l  Pl 


(50) 


(51) 


by  (47)  and  (48). 


oc  n  n  n  qo  n 

1-1  1  j-l  *=1  j  i=l  j=n+l  ij  p=  1  p  i 

Substituting  (51)  we  get 


n  n  n  sc 

=  £  X  £  A..  +  £  £  A.  X 

j=l  J  i  =  l  lJ  p=l  i=n+l  lP  P 


(52) 


n  n  go 

£  X  £  A  4  £ 
>=1  P  i=  1  ip  i=n+l 


Ai, 


X 

P 


n  n 

£  p.  <  £  X 
i=l  l~p=l  P 


Q.  E.  D. 
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•SECTION  DC 


CONCLUSIONS  AND  SUGGESTIONS  FOR  FUTURE  WORK 


The  L-K  and  Discriminant  Analysis  methods  studied  provide  a 
promising  basic  methodology  for  computer -assisted  analysis  and 
classification  of  biological  data.  When  properly  used,  the  computer 
programs  which  have  been  developed  should  become  useful  tools  in 
the  hands  of  research  workers  in  the  biological  sciences.  It  would, 
however,  be  prudent  to  consider  them  still  as  experimental  tools 
until  they  have  been  exercised  and  ey^luated  on  much  more  data. 

It  is  suggested  that  future  research  be  devoted  to  the  consider¬ 
ations  of  methods  of  analysis  based  on  other  than  the  mean-square 
1617  7  18 

criterion  ,  and  on  extensions  to  multiple  time  signals  . 
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